Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Tue Dec 01 21:00:09 2020"
"I am Ambrin from the University of Eastern Finland"
## [1] "I am Ambrin from the University of Eastern Finland"
"I am based in Kuopio."
## [1] "I am based in Kuopio."
"I am doing my PhD in Health Sciences"
## [1] "I am doing my PhD in Health Sciences"
"I am planning to learn basics of R and concepts about open data science here"
## [1] "I am planning to learn basics of R and concepts about open data science here"
"I am curious"
## [1] "I am curious"
"I heard about this course from the yammer platform of my University"
## [1] "I heard about this course from the yammer platform of my University"
"Its great to learn new things."
## [1] "Its great to learn new things."
"The link to my github repository is https://github.com/ambrinbabu/IODS-project/"
## [1] "The link to my github repository is https://github.com/ambrinbabu/IODS-project/"

The text continues here.

:/home/Ambrin$ git config –global Ambrin.email


Insert chapter 2 title here

Describe the work you have done this week and summarize your learning.

date()
## [1] "Tue Dec 01 21:00:09 2020"

We would like to read the students2014 data into R from the url provided and would like to explore the data.

#Read the students2014 data into R from url and store in students2014
students2014 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt",sep=",", header=TRUE )
head(students2014)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21
#Explore the structure of the data 
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

The data is in the form of a data frame and as we can see, there are 7 columns of data representing gender, age, attitude, deep, stra, surf and points. There are 166 observations (rows). The data comes from ASSIST (The approaches and study skills inventory for students) and includes information about 116 students of different age groups and comprise both male and female and approaches they use for learning - surface approach (memorise without understanding with a serious lack of personal engagement in learning process), deep approach (intention to maximise understanding with a true commitment to learning) and strategic approach (apply any strategy that maximises the chance of chieving highest possible grades). The student achievments are measured by points in the exams

#Explore the dimensions of the data
dim(students2014)
## [1] 166   7
#ggplot2 is a popular library for creating stunning graphics with R.
#install.packages("ggplot2")
library(ggplot2)
#Show a graphical overview of the data 
p1 <- ggplot(students2014, aes(x = attitude, y = points, col = gender))

# define the visualization type (points)
p2 <- p1 + geom_point()


# add a regression line
p3 <- p2 + geom_smooth(method = "lm")


# add a main title and draw the plot
p4 <- p3+ ggtitle("Student's attitude versus exam points")
p4
## `geom_smooth()` using formula 'y ~ x'

Here we see a graphical overview of the data and see the relationship between the attitude vs exam points

#Show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them
##Simple regression

# a scatter plot of points versus attitude
library(ggplot2)
qplot(attitude, points, data = students2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

# fit a linear model
my_model <- lm(points ~ attitude, data = students2014)

# print out a summary of the model

summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

We see here a linear model fitted for our data.Linear regression model is a simple statistical model.It is an approach for modelling the relationship between a dependent variable y and one or more exploratory variables x.The model is found by minimising the sum of the residuals.Residuals are essentially the difference between the actual observed response values and the response values that the model predicted.

The exam points are the target variable and attitude is the explanatory variable. The variables are evenly distributed. The summary of the variables in the data is also shown.The Residuals section of the model output breaks it down into 5 summary points - min, 1Q, median, 2Q and max. The coefficients gives estimates for the parameters of the model.In this case the p value for attitude is very low. So there is a statistical relationship between attitude and points.

#Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.
my_model2 <- lm(points ~ attitude + stra + surf, data = students2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Here 3 variables (attitude, strategic learning, surface learning) are explanatory variables and a multiple regression model is fitted where points is the target variable. There are 5 point summary of residuals of model - min, 1Q, median, 3Q and Max.

Below the residuals section, we see coefficients which gives estimates for the parameters of the model. The estimate corresponding to the intercept is the estimate of alpha parameter and the estimate corresponding to attitude is beta parameter. Here we estimated the effect of attitude on points to be 11.01 with standard difference of approx 3.68. We also have t and p values corresponding to statistics test of null hypothesis that the actual value of beta parameter would be 0. In this case the p value for attitude is very low. So there is a statistical relationship between attitude and points.However, for stra and surf, these values are not statistically significant and so there is no statistical relationship between stra (or surf) and points.

Residual Standard Error is the measure of the quality of a linear regression fit. Theoretically, every linear model is assumed to contain an error term E. Due to the presence of this error term, we are not capable of perfectly predicting our explanatory variable from the target variable. The Residual Standard Error is the average amount that the response will deviate from the true regression line. It’s also worth noting that the Residual Standard Error was calculated with 162 degrees of freedom. Simplistically, degrees of freedom are the number of data points that went into the estimation of the parameters used after taking into account these parameters.

The R-squared (R2) statistic provides a measure of how well the model is fitting the actual data. It takes the form of a proportion of variance. R2 is a measure of the linear relationship between our target variable and our explanatory variable. It always lies between 0 and 1 (i.e.: a number near 0 represents a regression that does not explain the variance in the response variable well and a number close to 1 does explain the observed variance in the response variable). In our example, the R2 we get is 0.2074. Or roughly 20% of the variance found in the explanatory variables can be explained by the target variable.

#If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.
my_model2 <- lm(points ~ attitude, data = students2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

In the previous example, we saw that 2 of the explanatory variables (strategic and surface learning) did not have a statistically significant relationship with the target variable (points). Hence we removed the 2 variables and fitted the model again without it.

#Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = students2014)
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
#1 residuals vs fitted values
#2 Normal QQplot
#5 Residuals vs levarage
par(mfrow = c(2,2))
plot(my_model2, which = 2)

Statistical models always include several assumption which describe the data generating process. In a linear regression model, we assume linearity. The target variable is modelled as a linear combination of the model parameters. Usually it is assumed that the errors are normally distributed, not correlated and have constant variance. Further, its also assumed that the size of a given error does not depend on the values of the explanatory variables.

QQ-plot: QQ plot of the residuals provides a meathod to explore the assumption that the errors of the model are normally distributed. The better the points fall within the line, the better is the fit to the normality assumption. In our case, we see a reasonable fit.

#Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = students2014)
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
#1 residuals vs fitted values
#2 Normal QQplot
#5 Residuals vs levarage
par(mfrow = c(2,2))
plot(my_model2, which = 1)

The constant variance assumption implies that the size of the errors should not depend on the explanatory variables.This can be explored by plotting a scatter plot of residuals versus model predictors. In our case, we dont see when fitted values increase, spread of residuals increase, indicating a problem.

#Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = students2014)
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
#1 residuals vs fitted values
#2 Normal QQplot
#5 Residuals vs levarage
par(mfrow = c(2,2))
plot(my_model2, which = 5)

Leverage of observations measures how much impact a single observation has on the model.Residuals vs leverage plot can help identify which observations have an unusually high impact.We do not have one particular point with very high leverage so we can conclude that it is a regular leverage without any outliers.

Chapter 3

Describe the work you have done this week and summarize your learning.

date()
## [1] "Tue Dec 01 21:00:12 2020"
# read data 
alc <- read.csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv")
# explore structure and dimension
str(alc)
## 'data.frame':    370 obs. of  51 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : chr  "R" "R" "R" "R" ...
##  $ famsize   : chr  "GT3" "GT3" "GT3" "GT3" ...
##  $ Pstatus   : chr  "T" "T" "T" "T" ...
##  $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : chr  "at_home" "other" "at_home" "services" ...
##  $ Fjob      : chr  "other" "other" "other" "health" ...
##  $ reason    : chr  "home" "reputation" "reputation" "course" ...
##  $ guardian  : chr  "mother" "mother" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
##  $ schoolsup : chr  "yes" "yes" "yes" "yes" ...
##  $ famsup    : chr  "yes" "yes" "yes" "yes" ...
##  $ activities: chr  "yes" "no" "yes" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "yes" "yes" "no" "yes" ...
##  $ romantic  : chr  "no" "yes" "no" "no" ...
##  $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
##  $ n         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ id.p      : int  1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
##  $ id.m      : int  2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
##  $ failures  : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ paid      : chr  "yes" "no" "no" "no" ...
##  $ absences  : int  3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : int  10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : int  12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ failures.p: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ paid.p    : chr  "yes" "no" "no" "no" ...
##  $ absences.p: int  4 2 8 2 2 2 0 0 6 10 ...
##  $ G1.p      : int  13 13 14 10 13 11 10 11 15 10 ...
##  $ G2.p      : int  13 11 13 11 13 12 11 10 15 10 ...
##  $ G3.p      : int  13 11 12 10 13 12 12 11 15 10 ...
##  $ failures.m: int  1 2 0 0 2 0 2 0 0 0 ...
##  $ paid.m    : chr  "yes" "no" "yes" "yes" ...
##  $ absences.m: int  2 2 8 2 8 2 0 2 12 10 ...
##  $ G1.m      : int  7 8 14 10 10 12 12 8 16 10 ...
##  $ G2.m      : int  10 6 13 9 10 12 0 9 16 11 ...
##  $ G3.m      : int  10 5 13 8 10 11 0 8 16 11 ...
##  $ alc_use   : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
##  $ cid       : int  3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...
dim(alc)
## [1] 370  51

It is a dataframe with 370 observation and 51 variables. The data is about the secondary school in 2 portugese schools and their achievements. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).The data contains information also about the different students, their age, sex etc. and if they consume high levels of alcohol or not. The alcohol consumption scale ranges from 1 (very low) to 5 (very high).

Hypothesis: My hypothesis is that many reasons affect the alcohol consumption.I have chosen 4 interesting variables - gender, famrel, goout (going out with friends) and freetime. Gender- I hypothesise that male drink more than women. famrel - When there are more family relations, less alcohol is consumed goout - when someone goes out with friends, more alcohol is consumed freetime - when there is more freetime, more alcohol is consumed

#Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses.
library(tidyr); library(dplyr); library(ggplot2)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

alc %>% group_by(sex, high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups:   sex [2]
##   sex   high_use count
##   <chr> <lgl>    <int>
## 1 F     FALSE      154
## 2 F     TRUE        41
## 3 M     FALSE      105
## 4 M     TRUE        70
alc %>% group_by(failures, high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'failures' (override with `.groups` argument)
## # A tibble: 8 x 3
## # Groups:   failures [4]
##   failures high_use count
##      <int> <lgl>    <int>
## 1        0 FALSE      238
## 2        0 TRUE        87
## 3        1 FALSE       12
## 4        1 TRUE        12
## 5        2 FALSE        8
## 6        2 TRUE         9
## 7        3 FALSE        1
## 8        3 TRUE         3
alc %>% group_by(famrel, high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'famrel' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups:   famrel [5]
##    famrel high_use count
##     <int> <lgl>    <int>
##  1      1 FALSE        6
##  2      1 TRUE         2
##  3      2 FALSE        9
##  4      2 TRUE         9
##  5      3 FALSE       39
##  6      3 TRUE        25
##  7      4 FALSE      128
##  8      4 TRUE        52
##  9      5 FALSE       77
## 10      5 TRUE        23
alc %>% group_by(goout, high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'goout' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups:   goout [5]
##    goout high_use count
##    <int> <lgl>    <int>
##  1     1 FALSE       19
##  2     1 TRUE         3
##  3     2 FALSE       82
##  4     2 TRUE        15
##  5     3 FALSE       97
##  6     3 TRUE        23
##  7     4 FALSE       40
##  8     4 TRUE        38
##  9     5 FALSE       21
## 10     5 TRUE        32

Gender - more male students drink more (according to hypothesis) Famrel - More relatives around, less alcohol is consumed (according to hypothesis) goout - When students go out, more alcohol is consumed (according to hypothesis) freetime - more free time, more students drink alcohol (according to hypothesis)

#Graphical results

g1<- ggplot(alc, aes(x = high_use, y = famrel))+ geom_boxplot() + ggtitle("Family relationship vs alcohol consumption")
g1

g2<-ggplot(alc, aes(x = high_use, y = famrel, col = sex))+ geom_boxplot() + ggtitle("Family relationship vs alcohol consumption according to gender")
g2

g3<- ggplot(alc, aes(x = high_use, y = goout))+ geom_boxplot() + ggtitle("Going out with friends vs alcohol consumption")
g3

g4<-ggplot(alc, aes(x = high_use, y = goout, col = sex))+ geom_boxplot() + ggtitle("Going out with friends vs alcohol consumption according to gender")
g4

g5<- ggplot(alc, aes(x = high_use, y = freetime))+ geom_boxplot() + ggtitle("Freetime vs alcohol consumption")
g5

g6<-ggplot(alc, aes(x = high_use, y = freetime, col = sex))+ geom_boxplot() + ggtitle("Freetime vs alcohol consumption according to gender")
g6

Logistic regression

m <- glm(high_use ~ goout + famrel + sex+ freetime, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ goout + famrel + sex + freetime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6569  -0.7847  -0.5274   0.8571   2.5801  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.4926     0.6994  -3.564 0.000365 ***
## goout         0.7702     0.1267   6.078 1.22e-09 ***
## famrel       -0.4483     0.1414  -3.171 0.001519 ** 
## sexM          0.9558     0.2599   3.677 0.000236 ***
## freetime      0.1120     0.1400   0.800 0.423896    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 378.47  on 365  degrees of freedom
## AIC: 388.47
## 
## Number of Fisher Scoring iterations: 4

We see here a linear model fitted for our data.Linear regression model is a simple statistical model.It is an approach for modelling the relationship between a dependent variable y and one or more exploratory variables x.The model is found by minimising the sum of the residuals.Residuals are essentially the difference between the actual observed response values and the response values that the model predicted.

The high alcohol use is the target variable and goout + famrel + sex+ freetime is the explanatory variable. The variables are evenly distributed. The summary of the variables in the data is also shown.The Residuals section of the model output breaks it down into 5 summary points - min, 1Q, median, 2Q and max. The coefficients gives estimates for the parameters of the model.In this case the p value for goout, famrel, and sex is very low. So there is a statistical relationship between going out with friends, family relationships and sex with high alcohol use.Free time has p value above 0.05 so there is no statistical relationship between high alcohol use and free time.

#Present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them. Interpret the results and compare them to your previously stated hypothesis.
coef(m)
## (Intercept)       goout      famrel        sexM    freetime 
##  -2.4926487   0.7701903  -0.4483457   0.9557531   0.1119551
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp 
## Waiting for profiling to be done...
cbind(OR, CI)
##                     OR      2.5 %    97.5 %
## (Intercept) 0.08269065 0.02016869 0.3154115
## goout       2.16017719 1.69707698 2.7921949
## famrel      0.63868384 0.48200915 0.8405944
## sexM        2.60062833 1.57104372 4.3613169
## freetime    1.11846266 0.85030850 1.4743895

Odd ratios s a statistic that quantifies the strength of the association between two events. If it is greater than 1 we have a positive association and if the odd ratio is smaller than 1 its a negative association. Going out with friends, sex and free time have a positve association with high alcohol use an family relation has negative association. The results are according to the stated hypothesis. The confidence intervals are widest for the the sex variable, so its effect is the most uncertain

Predictive model Predictive power of the final logistic regression model is calculated without the statistically insignificant variable - freetime

m <- glm(high_use ~ goout + famrel + sex, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

select(alc, goout, famrel, sex, high_use, probability, prediction) %>% tail(10)
##     goout famrel sex high_use probability prediction
## 361     3      5   M     TRUE  0.25406225      FALSE
## 362     3      4   M     TRUE  0.34478448      FALSE
## 363     1      4   M     TRUE  0.09640288      FALSE
## 364     4      3   M     TRUE  0.64356587       TRUE
## 365     2      3   M    FALSE  0.26797363      FALSE
## 366     3      4   M     TRUE  0.34478448      FALSE
## 367     2      4   M    FALSE  0.19155369      FALSE
## 368     4      4   M     TRUE  0.53888551       TRUE
## 369     4      5   M     TRUE  0.43065938      FALSE
## 370     2      4   M    FALSE  0.19155369      FALSE
# creating a confusion matrix with actual values
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   241   18
##    TRUE     62   49
 # creating a confusion matrix with predicted values
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins 
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65135135 0.04864865 0.70000000
##    TRUE  0.16756757 0.13243243 0.30000000
##    Sum   0.81891892 0.18108108 1.00000000
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))+ geom_point() + ggtitle("logistic regression model")
g 

241 students dont consume high levels of alcohol. 18 are predicted wrongly (dont drink alcohol but is predicted to be drinking high levels of alcohol). 49 students drink alcohol and prediction is correct but 62 students drink alcohol but it is predicted that they dont.

Training error

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability) 
## [1] 0.2162162
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc)) # K-fold cross-validation, cv=training data
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2216216

Training error can be calculated by adding false positives and false negatives. This is further confirmed by loss function. Here, we can see the total proportion of inaccurately classified individuals. The number is about 22% and is not very high. So our model is performing good but can be improved.

Bonus question

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
 # compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2162162
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10) # K-fold cross-validation, cv=training data
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2297297

This model has a better test set performance (0.22) compared to model in Datacample(0.26). 10-fold cross-validation gives good estimate of the actual predictive power of the model. Low value = good

Chapter 4

Describe the work you have done this week and summarize your learning.

date()
## [1] "Tue Dec 01 21:00:25 2020"

In this exercise we use Boston data from MASS-library. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. Data includes 14 variables and 506 rows

# access the MASS package and load other libraries for later analysis
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## corrplot 0.84 loaded
library(dplyr)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# load the data
 data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
#Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
pairs(Boston)

There are some very interesting distributions fo variables in the plot matrix. Variable rad has high and low values so the plot shows that the values are concentrated on either side of the plot.

#Correlation matrix#

#calculating the correlation matrix and correlation plot
cor_matrix <- round(cor(Boston),digits = 2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

Plotted correlation matrix shows that there is some high correlation between variables: * Correlation is quite clear between industrial areas (indus) and nitrogen oxides (nox). Industry adds pollution in the area. Industry seems to correlate also with variablrs like age, dis, ras and tax. * Nitrogen oxides (nox) correlations are very similar with industry (indus) * Crime rate (crim) seems to correlate with good accessibilitty to radial highways (rad) and value property (tax). * Old houses (age) and employment centers have also something common

#Scaled data# All the variables are numerical so we can use scale()-function to scale whole data set

#Standardize the dataset and print out summaries of the scaled data. How did the variables change? 
boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
class(boston_scaled)
## [1] "matrix" "array"
boston_scaled <- as.data.frame(boston_scaled)

Scaling the data makes variables look as if they are in the same range. Variables like black and tax were before scaling hundred fold compared to some other variables

#Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. 
#save the scaled crim as scaled_crim
scaled_crim <- boston_scaled$crim
#create a quantile vector of crim and print it
bins <- quantile(scaled_crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
#create a categorical variable 'crime'
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
#look at the table of the new factor crime, do not change the actual variable "crime"
crime_tab <-table(crime)
crime_tab
## crime
##      low  med_low med_high     high 
##      127      126      126      127
#remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
##        zn               indus              chas              nox         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723   Min.   :-1.4644  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723   1st Qu.:-0.9121  
##  Median :-0.48724   Median :-0.2109   Median :-0.2723   Median :-0.1441  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723   3rd Qu.: 0.5981  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv              crime    
##  Min.   :-1.9063   low     :127  
##  1st Qu.:-0.5989   med_low :126  
##  Median :-0.1449   med_high:126  
##  Mean   : 0.0000   high    :127  
##  3rd Qu.: 0.2683                 
##  Max.   : 2.9865

#Train and test set# Training set contains 80% of the data. 20% is in the test set

#Divide the dataset to train and test sets, so that 80% of the data belongs to the train set
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
ind
##   [1] 494  84  49 404 287 107 183 400  78 121 339  64 103 319  71 256 163  99
##  [19]  96  15 241 462 109 226 303 227 503  32 484  66 430 147 405 415 259 435
##  [37] 442 316 127 135 245  97 502  50 365 311 190 210  48 364 194 281  88 358
##  [55] 378 366 133 159 418 294 488 426 357  46 231 476 232  98 230 447 301  10
##  [73] 272 161 222 251 337 172 445  52 481 277 246  55 420 450 460 407 208 367
##  [91] 242 156 346 440 485 111 363 429 309 155  45 290 213 344 483 466 126 198
## [109] 192 279 465 128 343  76 360 206 268 373  12 338  60 173   9 397  37 306
## [127] 388 312 384 181 500 138 264  94 247  26 322 178 471  73 314 498 110  72
## [145] 118   4 160  79  89 238  62 168  44 260 506 119 225 197 391 177 474 315
## [163] 477 291  58 304   1  27 249 493 336 255 446 448 271 394 297  47 348 275
## [181] 392 390 266 106  85 368 478 347 175 501  14  28 386 278 273  67 455  38
## [199] 105 328 288 428 116 289 218 283 389 449 433 371 467 180 422 340 487  93
## [217] 321 146  33 499 369 165 234 398 403 123 374 320 335 154 310 458 381 239
## [235] 207 276 286 451 411 125 469 308 491 274 292 164 495 482 223 139 324 214
## [253] 300 461 142 114 185  77  31 323 280 434  17 209 169 468  90 236  23 353
## [271]   3 475  11 196 362 401 151 132 382 108 148   6  91 354  24 380 191 375
## [289] 258  61 240 134  53 149 395 331 436 302 216 296 254 318  16 158 112 143
## [307] 170 235 184 137 228 182 355 257  75 269 167 444  63   8 187 437 453 421
## [325] 416 424  18 439 342 463  56 299  70 217  30 152 150  43 313 454  35 117
## [343] 480  40 243 489 201 325  86 189 432 396 472 376  83 329 293 188 193 270
## [361] 492 470  82 330 419 285 104 253  51 459 215 166 248  36  39  21 359 412
## [379] 224  68 351 370 262 341  54 497 496 356 414 298 431 443 372 307 486  65
## [397] 144  25 250 350 212  87 473 130
# create train set
train <- boston_scaled[ind,]
# create test set 
test <- boston_scaled[-ind,]

#Fitting the Linear Discriminant Analysis# First the linear discriminant analysis (LDA) is fitted to the train set. The new categorical variable crime is the target variable and all the other variables of the dataset are predictor variables. After fitting we draw the LDA biplot with arrows

#Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.
#LDA = linear discriminant analysis
lda.fit <- lda(crime ~. , data = train)
#print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2400990 0.2574257 0.2500000 0.2524752 
## 
## Group means:
##                   zn      indus         chas        nox          rm        age
## low       0.87753465 -0.9142590 -0.109974419 -0.8789504  0.36777760 -0.8831785
## med_low  -0.06135478 -0.3323374 -0.083045403 -0.5938280 -0.13077447 -0.3359508
## med_high -0.39214634  0.1914074  0.195445218  0.3617505  0.06944231  0.3950127
## high     -0.48724019  1.0171096 -0.002135914  1.0619806 -0.36865958  0.7995169
##                 dis        rad        tax     ptratio      black       lstat
## low       0.8998784 -0.6740347 -0.7268737 -0.37231840  0.3799492 -0.74622469
## med_low   0.3846261 -0.5489875 -0.4914841 -0.06384729  0.3190434 -0.12430626
## med_high -0.3543280 -0.3860329 -0.2905581 -0.27443991  0.0814652  0.01392118
## high     -0.8333519  1.6382099  1.5141140  0.78087177 -0.7267693  0.80863129
##                 medv
## low       0.43807804
## med_low   0.01817895
## med_high  0.15748237
## high     -0.63132217
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.07791512  0.658987207 -0.91027801
## indus    0.04019217 -0.404697250  0.31260140
## chas    -0.06639371 -0.034523075  0.01929512
## nox      0.38094428 -0.581147211 -1.43897953
## rm      -0.08896277 -0.078984039 -0.17515729
## age      0.28213334 -0.381963365 -0.04317174
## dis     -0.06434281 -0.216992521  0.16280173
## rad      3.00303746  0.813844725 -0.02882922
## tax     -0.03349339  0.159358596  0.51972016
## ptratio  0.10112459  0.062947720 -0.24951962
## black   -0.12581367  0.006081043  0.10434986
## lstat    0.20441392 -0.195875963  0.47715509
## medv     0.19056933 -0.343632049 -0.09977097
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9481 0.0395 0.0123
#the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
#target classes as numeric
classes <- as.numeric(train$crime)
classes
##   [1] 2 1 2 4 1 2 2 4 2 1 1 2 2 3 2 1 3 1 2 3 2 4 2 3 2 3 1 3 3 1 4 3 4 4 3 4 4
##  [38] 2 3 3 2 2 1 2 3 3 2 3 2 4 1 1 1 4 4 4 3 3 4 2 4 4 4 2 3 4 3 2 3 4 1 2 2 3
##  [75] 3 2 1 3 4 1 4 2 2 1 4 4 4 4 2 4 2 3 1 4 3 2 4 4 3 3 2 1 2 1 4 3 2 1 1 1 4
## [112] 3 1 2 4 2 3 4 2 1 2 2 2 4 2 1 4 3 4 1 2 3 3 1 3 3 2 1 4 2 3 3 3 2 2 1 3 1
## [149] 1 3 2 3 2 3 1 2 3 1 4 1 4 3 4 1 1 2 1 3 2 2 1 1 4 4 3 4 1 2 1 1 4 4 3 2 1
## [186] 4 4 1 2 2 3 3 4 1 2 1 4 1 2 2 1 4 2 1 1 1 4 4 4 4 4 1 4 1 4 1 2 3 3 2 4 3
## [223] 3 4 4 2 4 3 1 3 3 4 4 2 2 2 1 4 4 2 4 1 2 2 1 3 3 4 3 2 3 2 1 4 3 2 2 2 3
## [260] 3 2 4 3 2 3 4 1 3 3 1 1 4 2 1 4 4 3 3 4 2 3 1 1 1 3 4 2 4 3 2 2 3 1 3 4 1
## [297] 4 1 2 2 3 2 3 3 2 3 3 3 2 3 3 1 1 1 1 3 3 4 2 2 1 4 4 4 4 4 3 4 1 4 1 1 2
## [334] 1 3 3 3 2 3 4 3 2 4 1 2 2 1 3 1 2 4 4 4 4 1 1 1 1 2 2 2 4 1 1 4 1 2 2 2 4
## [371] 3 3 2 1 2 3 4 4 3 1 1 4 3 1 1 3 2 2 4 2 4 4 4 1 3 1 4 3 2 1 3 1 3 3
#plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 3)

#Predicting the classes#

#Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results
#save the correct classes from test data
correct_classes <- test$crime
#remove the crime variable from test data
test <- dplyr::select(test, -crime)
#predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
#cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       18      10        2    0
##   med_low    4      13        5    0
##   med_high   0       5       20    0
##   high       0       0        0   25

Prediction were quite good. There was some errors in the middle of the range but classes low and especially high were good. Only one correct representative of high class was predicted to med_low class.

#Reload the Boston dataset and standardize the dataset (we did not do this in the Datacamp exercises, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results
#Loading and scaling Boston data
scaled_Boston <- scale(Boston)
summary(scaled_Boston)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
#calculating the euclidean distance matrix between the observation
dist_eu <- dist(scaled_Boston)
#determining the max number of clusters
cluster_max <- 15
#calculate the total within sum of squares
#K-means might produce different results every time, because it randomly 
#assigns the initial cluster centers. The function set.seed() can be used to 
#deal with that.
set.seed(123)
twcss <- sapply(1:cluster_max, function(k){kmeans(dist_eu, k)$tot.withinss})
# visualize the results
plot(1:cluster_max, twcss, type='b')

One way to determine the number of clusters is to look how the total of within cluster sum of squares (WCSS) behaves when the number of clusters changes. WCSS was calculated from 1 to 15 clusters. The optimal number of clusters is when the total WCSS drops radically. It seems that in this case optimal number of clusters is two. However we are here to learn so I decided to analyse model with four clusters.

After determining the number of clusters I run the K-means alcorithm again

#k-means clustering
km <-kmeans(dist_eu, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

It seems that when the data is divided to four clusters there is some clear differences in distriputions of several variables. Crim, zn, indus and blacks are variables were one can distinguish clear patterns between clusters. Some variables (rad & tax) show that sometimes 1 or 2 clusters make a clear distripution but observation of other two clusters are ambigious and there is no clear pattern to be regognised.

#BONUS: LDA using clusters as target classes#

#Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters? 
#Loading and scaling Boston data
scaled_Boston <- scale(Boston)
scaled_Boston <- as.data.frame(scaled_Boston)
#colnames(scaled_Boston)
#calculating the euclidean distance matrix between the observation
dist_eu <- dist(scaled_Boston)
#k-means clustering
set.seed(123)
km <-kmeans(dist_eu, centers = 4)
cm <- as.data.frame(km$cluster)
#adding the clusters to the scaled dataset
scaled_Boston <- data.frame(scaled_Boston, clust = cm)
colnames(scaled_Boston)[15] <- "clust"
summary(scaled_Boston)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv             clust      
##  Min.   :-1.5296   Min.   :-1.9063   Min.   :1.000  
##  1st Qu.:-0.7986   1st Qu.:-0.5989   1st Qu.:2.000  
##  Median :-0.1811   Median :-0.1449   Median :3.000  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   :2.943  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683   3rd Qu.:4.000  
##  Max.   : 3.5453   Max.   : 2.9865   Max.   :4.000
#Original Boston dataset is now scaled and the result of K-means clustering is saved to the variable *clust*


#LDA = linear discriminant analysis
lda.fit.km <- lda(clust ~. , data = scaled_Boston)
#print the lda.fit object
lda.fit.km
## Call:
## lda(clust ~ ., data = scaled_Boston)
## 
## Prior probabilities of groups:
##         1         2         3         4 
## 0.1304348 0.2272727 0.2114625 0.4308300 
## 
## Group means:
##         crim         zn      indus       chas        nox         rm        age
## 1  1.4330759 -0.4872402  1.0689719  0.4435073  1.3439101 -0.7461469  0.8575386
## 2  0.2797949 -0.4872402  1.1892663 -0.2723291  0.8998296 -0.2770011  0.7716696
## 3 -0.3912182  1.2671159 -0.8754697  0.5739635 -0.7359091  0.9938426 -0.6949417
## 4 -0.3894453 -0.2173896 -0.5212959 -0.2723291 -0.5203495 -0.1157814 -0.3256000
##          dis        rad        tax     ptratio       black      lstat
## 1 -0.9620552  1.2941816  1.2970210  0.42015742 -1.65562038  1.1930953
## 2 -0.7723199  0.9006160  1.0311612  0.60093343 -0.01717546  0.6116223
## 3  0.7751031 -0.5965444 -0.6369476 -0.96586616  0.34190729 -0.8200275
## 4  0.3182404 -0.5741127 -0.6240070  0.02986213  0.34248644 -0.2813666
##          medv
## 1 -0.81904111
## 2 -0.54636549
## 3  1.11919598
## 4 -0.01314324
## 
## Coefficients of linear discriminants:
##                 LD1        LD2         LD3
## crim    -0.18113078  0.5012256  0.60535205
## zn      -0.43297497  1.0486194 -0.67406151
## indus   -1.37753200 -0.3016928 -1.07034034
## chas     0.04307937  0.7598229  0.22448239
## nox     -1.04674638  0.3861005  0.33268952
## rm       0.14912869  0.1510367 -0.67942589
## age      0.09897424 -0.0523110 -0.26285587
## dis     -0.13139210  0.1593367  0.03487882
## rad     -0.65824136 -0.5189795 -0.48145070
## tax     -0.28903561  0.5773959 -0.10350513
## ptratio -0.22236843 -0.1668597  0.09181715
## black    0.42730704 -0.5843973 -0.89869354
## lstat   -0.24320629  0.6197780  0.01119242
## medv    -0.21961575  0.9485829  0.17065360
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.7596 0.1768 0.0636
#the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
#target classes as numeric
classes <- as.numeric(scaled_Boston$clust)
#classes
#plot the lda results
plot(lda.fit.km, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit.km, myscale = 3)

#Super-bonus# 3D plot where observations color is the crime classes of the train set

model_predictors <- dplyr::select(train, -crime)
#check the dimensions
#dim(model_predictors)
#dim(lda.fit$scaling)
#matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#3d plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

3D plot where observations color is based on the K-means clusters

model_predictors <- dplyr::select(scaled_Boston, -clust)
#check the dimensions
#dim(model_predictors)
#dim(lda.fit.km$scaling)
#matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit.km$scaling
matrix_product <- as.data.frame(matrix_product)
#3D plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = scaled_Boston$clust)

Colors of the both plots is based to four classes. It seems that K-means plot shows the different clusters more clearly than the plot that is based on the crime classification.

Chapter 5

Describe the work you have done this week and summarize your learning.

date()
## [1] "Tue Dec 01 21:00:39 2020"
# Load required libraries
library(ggplot2)
library(dplyr)
library(corrplot)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(tidyr)

Load human data

human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep=",",header=T)
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   8

The data is from United Nations Development Programme data from Human Development Index (HDI) and Gender Inequality Index (GII) databases. We have information on demographic characteristics, including life expectancy (Life.exp), maternal mortality (Mat.mor) etc.Socioeconomic indicators related to gender equality include the ratio of female labour force rate to male labour force rate (Labo.FM) and the ratio of rate of secondary educated females to the rate of secondary educated males (Edu.FM). Lastly, we have information on the proportion of females in the parliament (Parli.F) and the gross national income (GNI) and the expected number of years of education (Edu.Exp). The data are available for 155 different countries.

##Graphical overview of the data

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# Histograms of the variables
human %>% 
  gather(key=var_name, value = value) %>% 
  ggplot(aes(x=value)) +
  geom_histogram() +
  facet_wrap(~var_name, scales = "free_x")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most of the data are not normally distributed. Adolescent births rate, GNI, maternal mortality are heavily tailed with most of the values being low. Education expectation, women in parliament and women’s labour participation values are roughly normally distributed although with high kurtosis values. Education ratio of females to men and life expectancy values have multiple peaks and have complicated distributions.

# Correlations of the variables.
# Create color ramp from dull dark blue to white to dull red.
colorVector <- c("#4477AA", "#77AADD", "#FFFFFF", "#EE9988", "#BB4444")

# Print the correlation matrix
corMatrix<-cor(human) 
corMatrix %>% round(digits = 2)
##           Edu2.FM Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM      1.00    0.01    0.59     0.58  0.43   -0.66     -0.53    0.08
## Labo.FM      0.01    1.00    0.05    -0.14 -0.02    0.24      0.12    0.25
## Edu.Exp      0.59    0.05    1.00     0.79  0.62   -0.74     -0.70    0.21
## Life.Exp     0.58   -0.14    0.79     1.00  0.63   -0.86     -0.73    0.17
## GNI          0.43   -0.02    0.62     0.63  1.00   -0.50     -0.56    0.09
## Mat.Mor     -0.66    0.24   -0.74    -0.86 -0.50    1.00      0.76   -0.09
## Ado.Birth   -0.53    0.12   -0.70    -0.73 -0.56    0.76      1.00   -0.07
## Parli.F      0.08    0.25    0.21     0.17  0.09   -0.09     -0.07    1.00
# Visualize the correlation matrix


corrplot(corMatrix, method = "color", col = colorRampPalette(colorVector)(200),
         type = "upper", order = "hclust", number.cex = .8,
         addCoef.col = "black", # Add coefficient of correlation
         tl.col = "black", tl.srt = 30, # Text label color and rotation
         # Combine with significance
         #p.mat = p.mat, sig.level = 0.01, insig = "blank", 
         # hide correlation coefficient on the principal diagonal
         diag = FALSE)

The highest positive correlations were between education expectancy and life expectancy. GNI, education expectation, and life expectancy all had quite strong positive correlations to each other. Maternal mortality was strongly negatively correlated with life expectancy, education expectancy and ratio of female to male education. The variables women in parliament and ratio of women in labour force had low correlations to other variables.

##Perform PCA on non-standardized human data Given that the data used describes multiple aspects of societies, identifying bivariate associations is somewhat uninteresting. Therefore, PCA was done to identify whether the indicators presented above belong to same dimensions and if the dimensions have meaningful relationships between each other.

# perform principal component analysis (with the SVD method)
pcaHuman <- prcomp(human)

# print out a summary of PCA. One gets quite a few warnings. The first component explains a whopping 99.99 % of the variance.
s <- summary(pcaHuman)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)

# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot of the principal component representation and the original variables using the first 2 components. GNI explains looks to explain pretty much all of the first principal component.
biplot(pcaHuman, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], xlim = c(-0.5, 0.2))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

The model identified 8 principle components but the first of these explains most of the variation in the data. If we look at the biplot, we can see that the only important component seems to be the gross national income. The fact that GNI overrides all the other variables is related to the fact that in the unmodified data, all the variables have different variances and the PCA treats the variable with the largest variance as the most important one. Therefore, to actually identify the real dimensions, the data was scaled and the analysis was run again.

##Perform PCA on standardized human data

humanStand <- scale(human)

pcaHumanStand <- prcomp(humanStand)

# print out a summary of PCA. One gets quite a few warnings.
s2 <- summary(pcaHumanStand)
s2
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# rounded percetanges of variance captured by each PC. 
pca_pr2 <- round(100*s2$importance[2, ], digits = 1)

# print out the percentages of variance. Now the components explain the data much more diversly. The first one explains 52 % of the variability, with the next 3 components explaining 43 % of the variability. 
pca_pr2
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# create object pc_lab to be used as axis labels
pc_lab2 <- paste0(names(pca_pr2), " (", pca_pr2, "%)")

# draw a biplot of the principal component representation and the original variables using the first 2 components. GNI explains looks to explain pretty much all of the first principal component.
biplot(pcaHumanStand, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab2[1], ylab = pc_lab2[2], xlim = c(-0.25, 0.25))

The results are quite different. Before there was only one component of any note. After scaling there are say 3 or 4 significant components.

We can see that the sociodemographic indicators education, GNI, life expectancy, maternal mortality and teen births load to the first principal component. That component explains over 50% of the total variability in the data. We also see that maternal mortality and teen births operate to an opposite direction when compared to the other factors. These makes sense as it would be weird if for instance GNI would increase with increasing rates of maternal mortality and teen births. These correlations were already identified above in the graphical overview step. Second, the new PCA produced another distinct principal component, which seems to describe gender equality. The gender ratio at the labour market and proportion of females in parliament relate to this component. This dimension seems to be genuinely distinct from the first as the variables related to this component have almost 90 degree angle (meaning low correlation) in the arrows when compared to the indicators influencing dimension one.I might interpret this to indicate that gender equality in labour market and parliament is not related to economic and “vital” well-being in the society.Instead, other factors (maybe values, attitudes etc.) are at play. A surprising thing is that the gender equality in education seems not to belong to the gender equality component. However, this is probably because the variable only includes information on secondary education. It might be that for overall increases well-being, it is necessary to have a population where each member has at least some education. Differences might occur if tertiary education was used as the measure of education.

I would guess most of the differences are due to scaling normalizing the data, which is an expected attribute in most analyses. Re-do the histogram of beginning to check this out.

as.data.frame(humanStand) %>% 
  gather(key=var_name, value = value) %>% 
  ggplot(aes(x=value)) +
  geom_histogram() +
  facet_wrap(~var_name, scales = "free_x")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

No, not that different actually. One major thing at least is that PCA assumes that large values mean more importance. So before the GNI which had way bigger numbers was given the most importance. This does not make too much sense, as the units are different for different variables.

##Interpretation One can with standardization more easily see the correlations between different variables. PC1 is composed mostly of educational expectation, GNI, ratio of female to male education, life expectancy and maternal mortality. PC2 is composed mostly of women and parliament and females in labour force ratio. The biplots are certainly easier to read after scaling as the different variables are on similar scales instead of wildly different ones. I would imagine that PC1 is mostly the level or resources put into people, like medicine and schooling. PC2 might be some kind of equality measure that measures how well can women attend the working life instead of being home wives.

library(FactoMineR)
data(tea)

# Explore the data. The tea dataset has 300 observations and 36 variables.
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
# Since there are so many variables one needs to split them for visualization.
gather(tea[1:12]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")  + geom_bar() + theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

gather(tea[13:24]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")  + geom_bar() + theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

gather(tea[25:36]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")  + geom_bar() + theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

# Perform multiple correspondence analysis on the tea data. Some of the columns seem to give errors, so keep only a subset of variables.
keep_columns <- c("Tea", "how", "sugar", "where", "lunch", "exciting", "price", "Sport")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))

mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.264   0.217   0.159   0.146   0.139   0.132   0.124
## % of var.             14.065  11.566   8.481   7.801   7.420   7.037   6.640
## Cumulative % of var.  14.065  25.632  34.113  41.913  49.333  56.370  63.010
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.116   0.111   0.106   0.093   0.087   0.071   0.064
## % of var.              6.189   5.903   5.675   4.958   4.665   3.800   3.401
## Cumulative % of var.  69.199  75.102  80.777  85.735  90.401  94.200  97.601
##                       Dim.15
## Variance               0.045
## % of var.              2.399
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                         Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                    | -0.356  0.160  0.033 |  0.469  0.338  0.057 | -0.240
## 2                    | -0.122  0.019  0.013 | -0.028  0.001  0.001 | -0.313
## 3                    | -0.228  0.066  0.070 | -0.087  0.012  0.010 | -0.189
## 4                    | -0.385  0.187  0.171 |  0.057  0.005  0.004 |  0.011
## 5                    | -0.228  0.066  0.070 | -0.087  0.012  0.010 | -0.189
## 6                    | -0.404  0.206  0.074 |  0.272  0.114  0.034 | -0.063
## 7                    | -0.228  0.066  0.070 | -0.087  0.012  0.010 | -0.189
## 8                    | -0.053  0.004  0.003 |  0.036  0.002  0.001 | -0.587
## 9                    |  0.629  0.501  0.250 | -0.520  0.416  0.171 | -0.262
## 10                   |  0.771  0.752  0.298 | -0.273  0.115  0.037 | -0.844
##                         ctr   cos2  
## 1                     0.120  0.015 |
## 2                     0.205  0.083 |
## 3                     0.075  0.048 |
## 4                     0.000  0.000 |
## 5                     0.075  0.048 |
## 6                     0.008  0.002 |
## 7                     0.075  0.048 |
## 8                     0.721  0.325 |
## 9                     0.144  0.043 |
## 10                    1.494  0.357 |
## 
## Categories (the 10 first)
##                          Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black                |   0.477   2.657   0.074   4.717 |   0.239   0.813
## Earl Grey            |  -0.243   1.795   0.106  -5.635 |  -0.218   1.768
## green                |   0.350   0.639   0.015   2.128 |   0.741   3.479
## tea bag              |  -0.574   8.857   0.431 -11.355 |   0.367   4.408
## tea bag+unpackaged   |   0.339   1.706   0.052   3.958 |  -1.017  18.690
## unpackaged           |   1.827  18.981   0.455  11.665 |   0.921   5.872
## No.sugar             |   0.245   1.466   0.064   4.375 |  -0.036   0.040
## sugar                |  -0.262   1.568   0.064  -4.375 |   0.039   0.042
## chain store          |  -0.530   8.520   0.499 -12.219 |   0.291   3.119
## chain store+tea shop |   0.507   3.163   0.090   5.193 |  -1.138  19.413
##                         cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                  0.019   2.366 |  -0.955  17.688   0.299  -9.450 |
## Earl Grey              0.086  -5.070 |   0.312   4.926   0.176   7.248 |
## green                  0.068   4.503 |   0.316   0.865   0.012   1.923 |
## tea bag                0.176   7.264 |  -0.034   0.052   0.002  -0.673 |
## tea bag+unpackaged     0.472 -11.882 |  -0.139   0.474   0.009  -1.620 |
## unpackaged             0.116   5.883 |   0.523   2.579   0.037   3.339 |
## No.sugar               0.001  -0.651 |  -0.594  14.344   0.378 -10.625 |
## sugar                  0.001   0.651 |   0.635  15.333   0.378  10.625 |
## chain store            0.150   6.704 |  -0.017   0.014   0.000  -0.384 |
## chain store+tea shop   0.455 -11.666 |  -0.347   2.455   0.042  -3.552 |
## 
## Categorical variables (eta2)
##                        Dim.1 Dim.2 Dim.3  
## Tea                  | 0.107 0.105 0.299 |
## how                  | 0.623 0.503 0.039 |
## sugar                | 0.064 0.001 0.378 |
## where                | 0.677 0.512 0.133 |
## lunch                | 0.000 0.164 0.132 |
## exciting             | 0.019 0.013 0.181 |
## price                | 0.615 0.386 0.027 |
## Sport                | 0.004 0.052 0.084 |
# visualize MCA variables
plot(mca, invisible=c("ind"), habillage = "quali")

he MCA calculates distances between variables in a three-dimensional space (I think, at least). In the plot above, the distances between the variables at first two-dimensions are plotted. We can see that the variable categories opposite to each other (no/yes) are plotted to opposite quadrants of the plot. Second, we see that similar variables are plotted close to each other (for instance not breakfast and not tea time). Third, the variable categories that are well categorized by the dimensions occur further from the center of the plot than others.We can clearly see that especially dinner and lunch seem to determine to be well distinguished. We can also confirm this by looking at the bar plots of the variables: it is clear that there seems to be a relatively small group of lunch or dinner drinkers.

# Dimensions 1 and 2 of MCA correspond mostly to what package people use their tea in, where they drink and the price of the tea. Dimension 3 corresponds somewhat to what kind of tea is drank, and if they add sugar or not. 

# Visualize MCA individuals
plot(mca, invisible=c("var"), habillage = "quali")

So many people it’s hard to see different numbers in the clouds. Anyway one can for example see that in the upper right individual 190 and 208 are quite similar in their tea habits, as they are close in the plot. In upper left corner, we can see individuals whose tea drinking habits are characterized by drinking tea during lunch and evenings. In the upper right corner we have those individuals who apparently do not drink tea at all. The lower right corner represents tea drinkers that limit their consumption to dinner time. Finally, the lower left corner includes individuals that want to preserve their good night’s sleep and only drink tea in the mornings and during tea time. The edgiest group seem to be those drinking with dinner as they do not tolerate drinking tea at any other time

Chapter 6

Describe the work you have done this week and summarize your learning.

date()
## [1] "Tue Dec 01 21:00:49 2020"

Introduction

The analyses from the chapters 8 and 9 of the book Multivariate Analysis for the Behavioral Sciences (Vehkalahti & Everitt 2019) are repeated. In chapter 8, longitudinal analysis on the scores of a brief psychiatric rating scale among 40 males was conducted. Chapter 9 of the book focuses on the time evolution of weights of rats. However, the 2 datasets have been swapped

Longitudinal analysis of rat growth

Weight of rats:

setwd("~/IODS-project/data")
library(openxlsx)
library(knitr)
library(tidyverse)
library(kableExtra)
rats <- read.xlsx("rats.xlsx")
#Transform vars
rats$id <- as.factor(rats$id)
rats$group <- as.factor(rats$group)
rats$time <- as.integer(rats$time)
#First 10 rows
kable(head(rats,n=10)) %>%
  kable_styling(full_width=F)
id group time weight
1 1 1 240
1 1 8 250
1 1 15 255
1 1 22 260
1 1 29 262
1 1 36 258
1 1 43 266
1 1 44 266
1 1 50 265
1 1 57 272
str(rats)
## 'data.frame':    176 obs. of  4 variables:
##  $ id    : Factor w/ 16 levels "1","10","11",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ time  : int  1 8 15 22 29 36 43 44 50 57 ...
##  $ weight: num  240 250 255 260 262 258 266 266 265 272 ...
table(rats$id)
## 
##  1 10 11 12 13 14 15 16  2  3  4  5  6  7  8  9 
## 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
table(rats$group)
## 
##  1  2  3 
## 88 44 44
table(rats$time)
## 
##  1  8 15 22 29 36 43 44 50 57 64 
## 16 16 16 16 16 16 16 16 16 16 16

We can see that the data is in long format, so that each row of the data includes a time observation, grouped by the rat id. The rats are grouped into three separate groups, with 8 rats in the first and 4 rats in the second and third group.The groups are defined by rat diet.

labels <- c("1"="Group 1",
            "2"="Group 2",
            "3"= "Group 3")
rats %>%
  ggplot(aes(x=time,
             y=weight)) +
  facet_wrap(~group,
             labeller=labeller(
               group=labels)) +
  geom_line(size=1,aes(col=id)) +
  scale_colour_viridis_d(option='inferno') +
  theme_minimal() +
  theme(legend.position='none') 

We can see in the plot the evolution of weight of rates in time, separated by the 3 groups.The rats in group 1 have smaller weight compared to group 2 and 3. One of the rats in group 2 is bigger and could be considered as an outlier.In general, over time, all mice in all groups increase their weight.

Standardised data

rats <- 
  rats %>%
  group_by(time) %>%
  mutate(mean_weight=mean(weight),
         sd_weight=sd(weight)) %>%
  mutate(std_weight=
           (weight-mean_weight)/
           sd(weight)) %>%
  ungroup()
rats %>%
  ggplot(aes(x=time,
             y=std_weight)) +
  facet_wrap(~group,
             labeller=labeller(
               group=labels)) +
  geom_line(size=1,aes(col=id)) +
  scale_colour_viridis_d(option='inferno') +
  theme_minimal() +
  theme(legend.position='none') 

In group 1, the weight is stable over time. In group 2, one mouse has lost weight and all others have gained. In group 3, the weight is also stable with marginal increase or decrease in the mice.

visualization of a summary These individual growth profiles are of little use for a researcher interested in statistical rat weights. An usual approach to combine information is some sort of averaging over groups of interest, in this case the diet groups.

rats_s <-
  rats %>%
  group_by(group,time) %>%
  mutate(mean=mean(weight),
         sd=sd(weight),
         n=n()) %>%
  mutate(error=
           qt(0.975,df=n-1)*sd/sqrt(n)) %>%
  mutate(lower=mean-error,upper=mean+error)
  
rats_s %>%
  ggplot(aes(x=time,
             y=mean,
             col=group)) +
  geom_ribbon(aes(
    ymax=upper,
    ymin=lower,
    fill=group),
    alpha=0.3) +
  geom_line(size=1) +
  scale_color_viridis_d() +
  scale_fill_viridis_d() + 
  theme_minimal() +
  theme(legend.position='bottom') 

In the plot above, the data has been averaged over the group-specific means by time point, and 95% confidence intervals for these means were calculated. As can be seen from the figure, the rats in diet group 1 are clearly smaller than others. The confidence interval of the Group 2 is really wide, and absorps the Group 3. That seems to be related to the fact that there is this one huge rat. The outlier is also clearly visible below in the boxplot. We also see an outliers in the other groups.

rats %>%
  ggplot(aes(x=as.factor(time),
             y=weight,
             col=group)) +
  geom_boxplot() +
  #geom_jitter() +
  scale_fill_viridis_d() + 
  scale_colour_viridis_d() +
  theme_minimal() +
  theme(legend.position='bottom') 

Applying the summary approach A summary measure approach has been applied to longitudinal data. In this approach, we would like to have a look at how much the rats the differ in their weight, depending on the on the diet group (by averaging over the rat weights in each group, without taking into account the starting weight)

rats_s_2 <-
  rats %>%
  filter(time>1) %>%
  group_by(group,id) %>%
  mutate(mean=mean(weight)) %>%
  ungroup()
#filter leaves only one obs into the data
rats_s_2 %>% filter(time==8) %>%
  ggplot(aes(x=group,y=mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", 
               geom = "point",
               shape=23, size=4,
               fill = "white") +
  scale_y_continuous(
    name = 'mean weight during follow-up')

rats_s_2 %>% 
  filter(time==8) %>%
  select(c('id','group','mean')) %>%
  kable() %>% kable_styling(full_width=F)
id group mean
1 1 263.2
2 1 238.9
3 1 261.7
4 1 267.2
5 1 270.9
6 1 276.2
7 1 274.6
8 1 267.5
9 2 443.9
10 2 457.5
11 2 455.8
12 2 594.0
13 3 495.2
14 3 536.4
15 3 542.2
16 3 536.2

The boxplots of the average weights of all the rats by group reveal all groups have outliers. The outliers seem to especially affect the group mean in groups 2 and 3.

Now, remove the 3 outliers:

rats_s_2 %>% filter(id!=2) %>% #group 1
  filter(id!=12) %>% #group 2
  filter(id!=13) %>%
  filter(time==8) %>% #leave only one obs
  ggplot(aes(x=group,y=mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", 
               geom = "point",
               shape=23, size=4,
               fill = "white") +
  scale_y_continuous(
    name = 'mean weight during follow-up')

#Check the previous plot without these outliers
rats_s_t <-
  rats %>%
  filter(id!=2) %>% #group 1
  filter(id!=12) %>% #group 2
  filter(id!=13) %>% #group 3
  group_by(group,time) %>%
  mutate(mean=mean(weight),
         sd=sd(weight),
         n=n()) %>%
  mutate(error=
           qt(0.975,df=n-1)*sd/sqrt(n)) %>%
  mutate(lower=mean-error,upper=mean+error)
  
  
rats_s_t %>%
  ggplot(aes(x=time,
             y=mean,
             col=group)) +
  geom_ribbon(aes(
    ymax=upper,
    ymin=lower,
    fill=group),
    alpha=0.3) +
  geom_line(size=1) +
  scale_color_viridis_d() +
  scale_fill_viridis_d() + 
  theme_minimal() +
  theme(legend.position='bottom') 

So, now we have much prettier picture with group members resembling each other. The largest average weights seem to occur in the diet group 3, and smallest in the diet group 1.

ANOVA The statistical significance of the group differences with and without baseline weight was measured using ANOVA. (A simple t-test cant be used as we have 3 groups)

rats_no_out <- 
  rats_s_2 %>% filter(id!=2) %>% #group 1
  filter(id!=12) %>% #group 2
  filter(id!=13) %>% 
  filter(time==8) #leave only one row
anova(lm(mean~group, data=rats_no_out))
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## group      2 176917   88458  2836.4 1.687e-14 ***
## Residuals 10    312      31                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#baseline weight
baselines <- 
  rats %>% filter(time==1) %>%
  select(c('id','weight')) %>%
  rename(baseline=weight)
rats_no_out_2 <-
  inner_join(rats_no_out,baselines,by='id')
anova(lm(mean~group + baseline,
         data=rats_no_out_2))
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## group      2 176917   88458 3353.2062 1.181e-13 ***
## baseline   1     74      74    2.8219    0.1273    
## Residuals  9    237      26                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm(mean~group + baseline,
         data=rats_no_out_2))
## 
## Call:
## lm(formula = mean ~ group + baseline, data = rats_no_out_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6341 -2.8915  0.1102  2.0096  7.8989 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 221.3094    28.3120   7.817 2.66e-05 ***
## group2      152.7218    18.7452   8.147 1.91e-05 ***
## group3      219.6183    29.9107   7.342 4.36e-05 ***
## baseline      0.1866     0.1111   1.680    0.127    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.136 on 9 degrees of freedom
## Multiple R-squared:  0.9987, Adjusted R-squared:  0.9982 
## F-statistic:  2236 on 3 and 9 DF,  p-value: 3.048e-13

According to the ANOVA tables and a summary of regression model, the group difference in rat weight is statistically significant. Adding the baseline weight does not affect the results and the baseline weight is not statistically significantly associated with the average weight during follow-up. The diet groups are different to each other and the diet impacts rat size.

Analysis of Longitudinal Data II: LinearMixed Effects Models for Normal ResponseVariables

#read data
setwd("~/IODS-project/data")
bprs <- read.xlsx("bprs.xlsx")
#Transform vars
bprs$treatment <- as.factor(bprs$treatment)
bprs$week <- as.integer(bprs$week)
#First 10 rows
kable(head(bprs),n=10) %>%
  kable_styling(full_width=F)
treatment subject week bprs
1 1 0 42
1 1 1 36
1 1 2 36
1 1 3 43
1 1 4 41
1 1 5 40
summary(bprs)
##  treatment   subject               week        bprs      
##  1:180     Length:360         Min.   :0   Min.   :18.00  
##  2:180     Class :character   1st Qu.:2   1st Qu.:27.00  
##            Mode  :character   Median :4   Median :35.00  
##                               Mean   :4   Mean   :37.66  
##                               3rd Qu.:6   3rd Qu.:43.00  
##                               Max.   :8   Max.   :95.00
#make a backup id
bprs$subject <- as.integer(bprs$subject)
bprs$id2 <- 
  ifelse(bprs$treatment==1,
         bprs$subject,bprs$subject+20)
unique(bprs$id2)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
unique(bprs$week)
## [1] 0 1 2 3 4 5 6 7 8
bprs$subject <- as.factor(bprs$subject)

There are 360 observations divided to two treatment groups. In total, there are 40 individuals and each has 9 measurements: baseline (0) and 8 follow-up measurements.

visualization

#Looking at how these individuals look by the treatment group:
library(wesanderson)
mypal <- wes_palette('Royal1')
bprs %>%
  ggplot(aes(x=week,
             y=bprs,
             group=id2,
             col=treatment)) +
  geom_line(size=1) +
  scale_colour_manual(values=mypal) +
  theme_minimal()

#facet

bprs %>%
  ggplot(aes(x=week,
             y=bprs,
             group=id2,
             col=treatment)) +
  geom_line(size=1) +
  scale_colour_manual(values=mypal) +
  theme_minimal() +
  facet_wrap(~treatment)

The lines are quite messy, even after facet.In general, almost all bprs values decrease over the 8 weeks and that higher bprs scores at the beginning have usually also higher scores at the end.

Linear Regression Model We would like to see if time and the treatment has an impact to the BPRS score. The model summary indicates that time decreases the BPRS score, indicating improvements in mental well-being. The treatment received seems not to have a statistically significant association with BPRS score

mod1 <- lm(
  bprs~treatment + week,
  data=bprs)
summary(mod1)
## 
## Call:
## lm(formula = bprs ~ treatment + week, data = bprs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

The model above seems okay but we know that it is obviously wrong because ofthe high autocorrelation in these type of outcomes. Let’s try a better approach and fit a multilevel linear regression model, where we model an individual intercept for each individual (random intercept model).

Random Intercept Model

library(lme4)
# a random intercept model
modref <- lmer(
  bprs ~ treatment + week + (1 | id2),
  data = bprs, REML = FALSE)
summary(modref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ treatment + week + (1 | id2)
##    Data: bprs
## 
##      AIC      BIC   logLik deviance df.resid 
##   2582.9   2602.3  -1286.5   2572.9      355 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.27506 -0.59909 -0.06104  0.44226  3.15835 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  id2      (Intercept) 97.39    9.869   
##  Residual             54.23    7.364   
## Number of obs: 360, groups:  id2, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.3521  19.750
## treatment2    0.5722     3.2159   0.178
## week         -2.2704     0.1503 -15.104
## 
## Correlation of Fixed Effects:
##            (Intr) trtmn2
## treatment2 -0.684       
## week       -0.256  0.000

The model output indicates that there is considerable between-individual variance, which was to be expected. Regarding the effects of treatment and time, the coefficients are identical to the OLS model but the standard errors differ: for treatment, the SE is higher in the random intercept model, and the SE of time is smaller.

The random intercept model basically gives each individual their own base level of BPRS. Obviously, there can be differences by individual in the evolution of the outcome as well. To model this, we will need to give each individual their own slope:

Random Intercept and Random Slope Model

# a random intercept and 
#random slope model
modref2 <- lmer(
  bprs ~ treatment + week + (week | id2),
  data = bprs, REML = FALSE)
summary(modref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ treatment + week + (week | id2)
##    Data: bprs
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.2   2550.4  -1254.6   2509.2      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4655 -0.5150 -0.0920  0.4347  3.7353 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  id2      (Intercept) 167.827  12.955        
##           week          2.331   1.527   -0.67
##  Residual              36.747   6.062        
## Number of obs: 360, groups:  id2, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  45.9830     2.6470  17.372
## treatment2    1.5139     3.1392   0.482
## week         -2.2704     0.2713  -8.370
## 
## Correlation of Fixed Effects:
##            (Intr) trtmn2
## treatment2 -0.593       
## week       -0.545  0.000
anova(modref,modref2)
## Data: bprs
## Models:
## modref: bprs ~ treatment + week + (1 | id2)
## modref2: bprs ~ treatment + week + (week | id2)
##         npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## modref     5 2582.9 2602.3 -1286.5   2572.9                         
## modref2    7 2523.2 2550.4 -1254.6   2509.2 63.663  2  1.499e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model output is relatively similar to the random intercept model. The coefficient of treatment has lowered but the effect of time is still the same size. There is a small change in standard errors. We still found evidence of mental health improving with time but no impact of treatment. The ANOVA output indicates that the random intercept + random slope provides a better model fit to the random intercept model.The log-likelihood test is statistically significant and BIC and AIC smaller.

We have now modelled the within-individual variation and found that only time seems to be related to improvements in the BPRS, whereas treatment doesn’t produce betwee-individual differences. We will still need to see if the effects of treatment are dependent on time. For this purpose I’ll modify the above model by adding an interaction term between time and treatment.

Random Intercept and Random Slope Model with interaction

# a random intercept and 
#random slope model with interaction
modref3 <- lmer(
  bprs ~ treatment + week + treatment*week +
    (week | id2),
  data = bprs, REML = FALSE)
summary(modref3)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ treatment + week + treatment * week + (week | id2)
##    Data: bprs
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.5   2554.5  -1253.7   2507.5      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4747 -0.5256 -0.0866  0.4435  3.7884 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  id2      (Intercept) 164.204  12.814        
##           week          2.203   1.484   -0.66
##  Residual              36.748   6.062        
## Number of obs: 360, groups:  id2, 40
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.9840  16.047
## treatment2       -2.2911     4.2200  -0.543
## week             -2.6283     0.3752  -7.006
## treatment2:week   0.7158     0.5306   1.349
## 
## Correlation of Fixed Effects:
##             (Intr) trtmn2 week  
## treatment2  -0.707              
## week        -0.668  0.473       
## tretmnt2:wk  0.473 -0.668 -0.707
anova(modref2,modref3)
## Data: bprs
## Models:
## modref2: bprs ~ treatment + week + (week | id2)
## modref3: bprs ~ treatment + week + treatment * week + (week | id2)
##         npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
## modref2    7 2523.2 2550.4 -1254.6   2509.2                    
## modref3    8 2523.5 2554.6 -1253.7   2507.5  1.78  1     0.1821

According to the outputs, this model is worse than the previous.

library(patchwork)
## 
## Attaching package: 'patchwork'
## The following object is masked from 'package:MASS':
## 
##     area
observed<-
  bprs %>%
  ggplot(aes(x=week,
             y=bprs,
             group=id2,
             col=treatment)) +
  geom_line(size=1) +
  ggtitle("Observed values") +
  scale_colour_manual(values=mypal) +
  theme_minimal() +
  theme(legend.position='bottom')
bprs$fitted <- fitted(modref2)
fitted<-
  bprs %>%
  ggplot(aes(x=week,
             y=fitted,
             group=id2,
             col=treatment)) +
  geom_line(size=1) +
  scale_colour_manual(values=mypal) +
  ggtitle('Fitted') +
  scale_y_continuous(name="bprs") +
  theme_minimal()+
  theme(legend.position='bottom')
observed+fitted +
  plot_layout(guides='collect') &
  theme(legend.position='bottom')

The model seems to work relatively nicely. The random slopes and intercept are clearly visible in the Fitted panel above. To conclude, we see that the BPRS scale decreases with time but we do not identify any differences by the treatment received.